python基礎及數據科學之應用day 17[nltk資料庫介紹]

2023 iThome 鐵人賽

DAY 17

SideProject30

python基礎及數據科學之應用系列第 17 篇

15th鐵人賽

carsonleung

團隊沙培小子

2023-10-02 18:00:05

3496 瀏覽

分享至

day 17:

希望瀏覽數可以多點啦，更多人看我的教學後有所增長

什麼是Nltk

自然語言工具包 (NLTK) 是一個流行的 Python 庫，用於處理人類語言資料。它為諸如標記化、詞幹提取、標記、解析等任務提供了各種工具和資源。

想知道更多的話，可以去我的團友文章查看。

首先你需要在終端機輸入以下指令，才能下載nltk資料庫。

pip install nltk

但由於nltk資料庫太強大，如果一次過下載所有東西你的電腦不能負荷，所以有時候要在python檔案中輸入下列的程式碼才能下載到內容。

import nltk
nltk.download()

簡單聊天機器人

它們對使用者輸入的句子進行簡單的模式匹配，並以自動生成的句子進行回應。

import nltk

nltk.chat.eliza.eliza_chat() #執行

執行結果:

Therapist
---------
Talk to the program by typing in plain English, using normal upper-
and lower-case letters and punctuation.  Enter "quit" when done.
========================================================================
Hello.  How are you feeling today?
>

嘗試另一個機器人，例如iesha,rude和suntsu

https://www.nltk.org/api/nltk.chat.html

word tokenize()

NLTK 中的函數word_tokenize()使用預先訓練的詞語進行分拆，該分詞器應用各種規則和啟發式方法將輸入文字拆分為單字。它接受一個文字字串作為輸入並傳回一個標記列表。

範例:

import nltk

nltk.download('punkt') #下載punkt模組

text = "NLTK is a powerful library for AI chat bot"

important_words = nltk.word_tokenize(text) #執行

print(important_words)

執行結果:

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\leung\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
['NLTK', 'is', 'a', 'powerful', 'library', 'for', 'AI', 'chat', 'bot']

sent_tokenize()

sent_tokenize()函數是Python中NLTK（自然語言工具包），將輸入文字拆分為句子。它接受一個文字字串作為輸入並傳回一個句子列表。

import nltk
# nltk.download('punkt')

#這段文章是由Poe生產出
text = """
The sun rose in the clear blue sky, casting its warm rays upon the vibrant green landscape. Birds chirped their melodious tunes as a gentle breeze rustled the leaves of the trees. The air was filled with the sweet scent of blooming flowers. People began to emerge from their homes, greeting the day with a sense of anticipation and purpose. Some embarked on their daily routines, heading to work or school, while others sought adventure and exploration. Children played joyfully in the park, their laughter echoing through the air. It was a day brimming with possibilities, promising new experiences and memories to be made.
"""

sentences = nltk.sent_tokenize(text) #sentences is list
for sentence in sentences: #用迴圈輸出所有句子
    print(sentence)

執行結果:

The sun rose in the clear blue sky, casting its warm rays upon the vibrant green landscape.
Birds chirped their melodious tunes as a gentle breeze rustled the leaves of the trees.
The air was filled with the sweet scent of blooming flowers.
People began to emerge from their homes, greeting the day with a sense of anticipation and purpose.
Some embarked on their daily routines, heading to work or school, while others sought adventure and exploration.
Children played joyfully in the park, their laughter echoing through the air.
It was a day brimming with possibilities, promising new experiences and memories to be made.

什麼是 pos_tag()

pos_tag()將單字輸入並傳回一個值，那是代表他的詞性。

下面有一個列表解釋值的意思

值	意思
NNP	專有名詞
VBZ	動詞第三人稱
JJ	形容詞

想查看更多的話可以到下列網站參考
https://blog.csdn.net/JasonJarvan/article/details/79955664

範例:

import nltk
# nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')

text = "NLTK is a powerful library in Python."
tokens = nltk.word_tokenize(text) #執行
pos_tags = nltk.pos_tag(tokens)

print(pos_tags)

執行結果:

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\leung\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[('NLTK', 'NNP'), ('is', 'VBZ'), ('a', 'DT'), ('powerful', 'JJ'), ('library', 'NN'), ('in', 'IN'), ('Python', 'NNP'), ('.', '.')]

詞幹演算法

使用ps.stem()將單字縮減為其基本形式或字根形式。詞幹提取是一個透過刪除後綴或前綴來幫助標準化單字的過程。以下是如何使用NLTK函式庫執行詞幹擷取的範例：

# import these modules
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize
 
ps = PorterStemmer()
 
# choose some words to be stemmed
words = ["running", "jumps", "jumping", "played", "playing", " program", "programs", "programmer", "programming", "programmers"]
 
for w in words:
    print(w, " : ", ps.stem(w)) #執行

執行結果:

running  :  run
jumps  :  jump
jumping  :  jump
played  :  play
playing  :  play
 program  :   program
programs  :  program
programmer  :  programm
programming  :  program
programmers  :  programm

大家可以自行嘗試，今天的內容到這裏，如果覺得我的文章對你有幫助或有更好的建議，可以追蹤我和不妨在留言區提出，我們明天再見。